{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Spark DataFrames Project Exercise "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's get some quick practice with your new Spark DataFrame skills, you will be asked some basic questions about some stock market data, in this case Walmart Stock from the years 2012-2017. This exercise will just ask a bunch of questions, unlike the future machine learning exercises, which will be a little looser and be in the form of \"Consulting Projects\", but more on that later!\n",
    "\n",
    "For now, just answer the questions and complete the tasks below."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Use the walmart_stock.csv file to Answer and complete the  tasks below!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Start a simple Spark Session"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pyspark.sql import SparkSession\n",
    "\n",
    "spark = SparkSession.Builder().appName('Exercise').getOrCreate()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Load the Walmart Stock CSV File, have Spark infer the data types."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+-------------------+------------------+---------+---------+------------------+--------+------------------+\n",
      "|               Date|              Open|     High|      Low|             Close|  Volume|         Adj Close|\n",
      "+-------------------+------------------+---------+---------+------------------+--------+------------------+\n",
      "|2012-01-03 00:00:00|         59.970001|61.060001|59.869999|         60.330002|12668800|52.619234999999996|\n",
      "|2012-01-04 00:00:00|60.209998999999996|60.349998|59.470001|59.709998999999996| 9593300|         52.078475|\n",
      "|2012-01-05 00:00:00|         59.349998|59.619999|58.369999|         59.419998|12768200|         51.825539|\n",
      "+-------------------+------------------+---------+---------+------------------+--------+------------------+\n",
      "only showing top 3 rows\n",
      "\n"
     ]
    }
   ],
   "source": [
    "df = spark.read.csv('walmart_stock.csv', inferSchema=True, header=True)\n",
    "df.show(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### What are the column names?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['Date', 'Open', 'High', 'Low', 'Close', 'Volume', 'Adj Close']"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### What does the Schema look like?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "root\n",
      " |-- Date: timestamp (nullable = true)\n",
      " |-- Open: double (nullable = true)\n",
      " |-- High: double (nullable = true)\n",
      " |-- Low: double (nullable = true)\n",
      " |-- Close: double (nullable = true)\n",
      " |-- Volume: integer (nullable = true)\n",
      " |-- Adj Close: double (nullable = true)\n",
      "\n"
     ]
    }
   ],
   "source": [
    "df.printSchema()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Print out the first 5 columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Row(Date=datetime.datetime(2012, 1, 3, 0, 0), Open=59.970001, High=61.060001, Low=59.869999, Close=60.330002, Volume=12668800, Adj Close=52.619234999999996),\n",
       " Row(Date=datetime.datetime(2012, 1, 4, 0, 0), Open=60.209998999999996, High=60.349998, Low=59.470001, Close=59.709998999999996, Volume=9593300, Adj Close=52.078475),\n",
       " Row(Date=datetime.datetime(2012, 1, 5, 0, 0), Open=59.349998, High=59.619999, Low=58.369999, Close=59.419998, Volume=12768200, Adj Close=51.825539),\n",
       " Row(Date=datetime.datetime(2012, 1, 6, 0, 0), Open=59.419998, High=59.450001, Low=58.869999, Close=59.0, Volume=8069400, Adj Close=51.45922),\n",
       " Row(Date=datetime.datetime(2012, 1, 9, 0, 0), Open=59.029999, High=59.549999, Low=58.919998, Close=59.18, Volume=6679300, Adj Close=51.616215000000004)]"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.take(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Use describe() to learn about the DataFrame."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+-------+------------------+-----------------+-----------------+-----------------+-----------------+-----------------+\n",
      "|summary|              Open|             High|              Low|            Close|           Volume|        Adj Close|\n",
      "+-------+------------------+-----------------+-----------------+-----------------+-----------------+-----------------+\n",
      "|  count|              1258|             1258|             1258|             1258|             1258|             1258|\n",
      "|   mean| 72.35785375357709|72.83938807631165| 71.9186009594594|72.38844998012726|8222093.481717011|67.23883848728146|\n",
      "| stddev|  6.76809024470826|6.768186808159218|6.744075756255496|6.756859163732991|  4519780.8431556|6.722609449996857|\n",
      "|    min|56.389998999999996|        57.060001|        56.299999|        56.419998|          2094900|        50.363689|\n",
      "|    max|         90.800003|        90.970001|            89.25|        90.470001|         80898100|84.91421600000001|\n",
      "+-------+------------------+-----------------+-----------------+-----------------+-----------------+-----------------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "df.describe().show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Bonus Question!\n",
    "#### There are too many decimal places for mean and stddev in the describe() dataframe. Format the numbers to just show up to two decimal places. Pay careful attention to the datatypes that .describe() returns, we didn't cover how to do this exact formatting, but we covered something very similar. [Check this link for a hint](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column.cast)\n",
    "\n",
    "If you get stuck on this, don't worry, just view the solutions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "root\n",
      " |-- summary: string (nullable = true)\n",
      " |-- Open: string (nullable = true)\n",
      " |-- High: string (nullable = true)\n",
      " |-- Low: string (nullable = true)\n",
      " |-- Close: string (nullable = true)\n",
      " |-- Volume: string (nullable = true)\n",
      " |-- Adj Close: string (nullable = true)\n",
      "\n"
     ]
    }
   ],
   "source": [
    "df.describe().printSchema()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {},
   "outputs": [],
   "source": [
    "desc = df.describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+-------+------------------+-----------------+-----------------+-----------------+-----------------+-----------------+\n",
      "|summary|              Open|             High|              Low|            Close|           Volume|        Adj Close|\n",
      "+-------+------------------+-----------------+-----------------+-----------------+-----------------+-----------------+\n",
      "|  count|              1258|             1258|             1258|             1258|             1258|             1258|\n",
      "|   mean| 72.35785375357709|72.83938807631165| 71.9186009594594|72.38844998012726|8222093.481717011|67.23883848728146|\n",
      "| stddev|  6.76809024470826|6.768186808159218|6.744075756255496|6.756859163732991|  4519780.8431556|6.722609449996857|\n",
      "|    min|56.389998999999996|        57.060001|        56.299999|        56.419998|          2094900|        50.363689|\n",
      "|    max|         90.800003|        90.970001|            89.25|        90.470001|         80898100|84.91421600000001|\n",
      "+-------+------------------+-----------------+-----------------+-----------------+-----------------+-----------------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "desc.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+-------+--------+--------+--------+--------+-------------+\n",
      "|summary|    Open|    High|     Low|   Close|    Adj Close|\n",
      "+-------+--------+--------+--------+--------+-------------+\n",
      "|  count|1,258.00|1,258.00|1,258.00|1,258.00|     1,258.00|\n",
      "|   mean|   72.36|   72.84|   71.92|   72.39| 8,222,093.50|\n",
      "| stddev|    6.77|    6.77|    6.74|    6.76| 4,519,781.00|\n",
      "|    min|   56.39|   57.06|   56.30|   56.42| 2,094,900.00|\n",
      "|    max|   90.80|   90.97|   89.25|   90.47|80,898,096.00|\n",
      "+-------+--------+--------+--------+--------+-------------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "from pyspark.sql.functions import format_number\n",
    "\n",
    "desc.select(desc['summary'], format_number(desc['Open'].cast('float'), 2).alias('Open'), \n",
    "            format_number(desc['High'].cast('float'), 2).alias('High'),\n",
    "            format_number(desc['Low'].cast('float'), 2).alias('Low'), \n",
    "            format_number(desc['Close'].cast('float'), 2).alias('Close'),\n",
    "            format_number(desc['Volume'].cast('float'), 2).alias('Adj Close')).show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Create a new dataframe with a column called HV Ratio that is the ratio of the High Price versus volume of stock traded for a day."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+-------------------+------------------+---------+---------+------------------+--------+------------------+\n",
      "|               Date|              Open|     High|      Low|             Close|  Volume|         Adj Close|\n",
      "+-------------------+------------------+---------+---------+------------------+--------+------------------+\n",
      "|2012-01-03 00:00:00|         59.970001|61.060001|59.869999|         60.330002|12668800|52.619234999999996|\n",
      "|2012-01-04 00:00:00|60.209998999999996|60.349998|59.470001|59.709998999999996| 9593300|         52.078475|\n",
      "|2012-01-05 00:00:00|         59.349998|59.619999|58.369999|         59.419998|12768200|         51.825539|\n",
      "+-------------------+------------------+---------+---------+------------------+--------+------------------+\n",
      "only showing top 3 rows\n",
      "\n"
     ]
    }
   ],
   "source": [
    "df.show(3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+--------------------+\n",
      "|            HV Ratio|\n",
      "+--------------------+\n",
      "|4.819714653321546E-6|\n",
      "|6.290848613094555E-6|\n",
      "|4.669412994783916E-6|\n",
      "|7.367338463826307E-6|\n",
      "|8.915604778943901E-6|\n",
      "|8.644477436914568E-6|\n",
      "|9.351828421515645E-6|\n",
      "| 8.29141562102703E-6|\n",
      "|7.712212102001476E-6|\n",
      "|7.071764823529412E-6|\n",
      "|1.015495466386981E-5|\n",
      "|6.576354146362592...|\n",
      "| 5.90145296180676E-6|\n",
      "|8.547679455011844E-6|\n",
      "|8.420709512685392E-6|\n",
      "|1.041448341728929...|\n",
      "|8.316075414862431E-6|\n",
      "|9.721183814992126E-6|\n",
      "|8.029436027707578E-6|\n",
      "|6.307432259386365E-6|\n",
      "+--------------------+\n",
      "only showing top 20 rows\n",
      "\n"
     ]
    }
   ],
   "source": [
    "df.withColumn('HV Ratio', df['High'] / df['Volume']).select('HV Ratio').show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### What day had the Peak High in Price?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "datetime.datetime(2015, 1, 13, 0, 0)"
      ]
     },
     "execution_count": 64,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.orderBy(df['High'].desc()).take(1)[0][0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### What is the mean of the Close column?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+-----------------+\n",
      "|       avg(Close)|\n",
      "+-----------------+\n",
      "|72.38844998012726|\n",
      "+-----------------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "from pyspark.sql.functions import avg\n",
    "df.select(avg('Close')).show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+-----------------+\n",
      "|       avg(Close)|\n",
      "+-----------------+\n",
      "|72.38844998012726|\n",
      "+-----------------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "from pyspark.sql.functions import mean\n",
    "df.select(mean('Close')).show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### What is the max and min of the Volume column?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+-----------+-----------+\n",
      "|max(Volume)|min(Volume)|\n",
      "+-----------+-----------+\n",
      "|   80898100|    2094900|\n",
      "+-----------+-----------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "from pyspark.sql.functions import min, max\n",
    "df.select(max(df['Volume']), min(df['Volume'])).show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### How many days was the Close lower than 60 dollars?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "81"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from pyspark.sql.functions import countDistinct\n",
    "df.filter(df['Close'] < 60).select(countDistinct(df['Date'])).collect()[0][0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 68,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "81"
      ]
     },
     "execution_count": 68,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.filter(df['Close'] < 60).count()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### What percentage of the time was the High greater than 80 dollars ?\n",
    "#### In other words, (Number of Days High>80)/(Total Days in the dataset)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "9.141494435612083"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.filter(df['High'] > 80).select(countDistinct(df['Date'])).collect()[0][0] / df.select(countDistinct(df['Date'])).collect()[0][0] * 100"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 71,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "9.141494435612083"
      ]
     },
     "execution_count": 71,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.filter(df['High'] > 80).count() / df.count() * 100"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### What is the Pearson correlation between High and Volume?\n",
    "#### [Hint](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameStatFunctions.corr)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "-0.3384326061737161"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.corr('High', 'Volume')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 74,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+-------------------+\n",
      "| corr(High, Volume)|\n",
      "+-------------------+\n",
      "|-0.3384326061737161|\n",
      "+-------------------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "from pyspark.sql.functions import corr\n",
    "df.select(corr('High', 'Volume')).show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### What is the max High per year?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+-------------------+------------------+---------+---------+------------------+--------+------------------+----+\n",
      "|               Date|              Open|     High|      Low|             Close|  Volume|         Adj Close|Year|\n",
      "+-------------------+------------------+---------+---------+------------------+--------+------------------+----+\n",
      "|2012-01-03 00:00:00|         59.970001|61.060001|59.869999|         60.330002|12668800|52.619234999999996|2012|\n",
      "|2012-01-04 00:00:00|60.209998999999996|60.349998|59.470001|59.709998999999996| 9593300|         52.078475|2012|\n",
      "|2012-01-05 00:00:00|         59.349998|59.619999|58.369999|         59.419998|12768200|         51.825539|2012|\n",
      "+-------------------+------------------+---------+---------+------------------+--------+------------------+----+\n",
      "only showing top 3 rows\n",
      "\n"
     ]
    }
   ],
   "source": [
    "from pyspark.sql.functions import year\n",
    "\n",
    "year_df = df.withColumn('Year', year(df['Date']))\n",
    "year_df.show(3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+----+---------+\n",
      "|Year|max(High)|\n",
      "+----+---------+\n",
      "|2015|90.970001|\n",
      "|2013|81.370003|\n",
      "|2014|88.089996|\n",
      "|2012|77.599998|\n",
      "|2016|75.190002|\n",
      "+----+---------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "year_df.groupBy('Year').max().select(['Year', 'max(High)']).show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### What is the average Close for each Calendar Month?\n",
    "#### In other words, across all the years, what is the average Close price for Jan,Feb, Mar, etc... Your result will have a value for each of these months. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+-------------------+------------------+---------+---------+------------------+--------+------------------+\n",
      "|               Date|              Open|     High|      Low|             Close|  Volume|         Adj Close|\n",
      "+-------------------+------------------+---------+---------+------------------+--------+------------------+\n",
      "|2012-01-03 00:00:00|         59.970001|61.060001|59.869999|         60.330002|12668800|52.619234999999996|\n",
      "|2012-01-04 00:00:00|60.209998999999996|60.349998|59.470001|59.709998999999996| 9593300|         52.078475|\n",
      "|2012-01-05 00:00:00|         59.349998|59.619999|58.369999|         59.419998|12768200|         51.825539|\n",
      "+-------------------+------------------+---------+---------+------------------+--------+------------------+\n",
      "only showing top 3 rows\n",
      "\n"
     ]
    }
   ],
   "source": [
    "df.show(3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+-------------------+------------------+---------+---------+------------------+--------+------------------+-----+\n",
      "|               Date|              Open|     High|      Low|             Close|  Volume|         Adj Close|Month|\n",
      "+-------------------+------------------+---------+---------+------------------+--------+------------------+-----+\n",
      "|2012-01-03 00:00:00|         59.970001|61.060001|59.869999|         60.330002|12668800|52.619234999999996|    1|\n",
      "|2012-01-04 00:00:00|60.209998999999996|60.349998|59.470001|59.709998999999996| 9593300|         52.078475|    1|\n",
      "|2012-01-05 00:00:00|         59.349998|59.619999|58.369999|         59.419998|12768200|         51.825539|    1|\n",
      "+-------------------+------------------+---------+---------+------------------+--------+------------------+-----+\n",
      "only showing top 3 rows\n",
      "\n"
     ]
    }
   ],
   "source": [
    "from pyspark.sql.functions import month\n",
    "\n",
    "month_df = df.withColumn('Month', month(df['Date']))\n",
    "month_df.show(3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+-----+-----------------+\n",
      "|Month|       avg(Close)|\n",
      "+-----+-----------------+\n",
      "|   12|72.84792478301885|\n",
      "|    1|71.44801958415842|\n",
      "|    6| 72.4953774245283|\n",
      "|    3|71.77794377570092|\n",
      "|    5|72.30971688679247|\n",
      "|    9|72.18411785294116|\n",
      "|    4|72.97361900952382|\n",
      "|    8|73.02981855454546|\n",
      "|    7|74.43971943925233|\n",
      "|   10|71.57854545454543|\n",
      "|   11| 72.1110893069307|\n",
      "|    2|  71.306804443299|\n",
      "+-----+-----------------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "result = month_df.groupBy('Month').mean().select(['Month', 'avg(Close)'])\n",
    "result.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+-----+-----------------+\n",
      "|Month|       avg(Close)|\n",
      "+-----+-----------------+\n",
      "|    1|71.44801958415842|\n",
      "|    2|  71.306804443299|\n",
      "|    3|71.77794377570092|\n",
      "|    4|72.97361900952382|\n",
      "|    5|72.30971688679247|\n",
      "|    6| 72.4953774245283|\n",
      "|    7|74.43971943925233|\n",
      "|    8|73.02981855454546|\n",
      "|    9|72.18411785294116|\n",
      "|   10|71.57854545454543|\n",
      "|   11| 72.1110893069307|\n",
      "|   12|72.84792478301885|\n",
      "+-----+-----------------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "result.orderBy(result['Month']).show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Great Job!"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "conda_python3",
   "language": "python",
   "name": "conda_python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
